When using to_sql(), continue if duplicate primary keys are detected? #15988

rosstripi · 2017-04-12T23:42:58Z

Code Sample, a copy-pastable example if possible

df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)

Problem description

I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.

It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.

Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: English_United States.1252

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

rockg · 2017-04-13T02:14:41Z

This should also support the "on duplicate update" mode as well.

jorisvandenbossche · 2017-04-13T09:00:56Z

@rosstripi I think the idea to have this would certainly be accepted, but AFAIK the main bottleneck is an implementation for this using sql/sqlalchemy in a flavor agnostic way. Some exploration how this could be done is certainly welcome!

muniswamy89 · 2018-06-06T16:48:49Z

Hi did you figure out any workaround for this? Please let me know

AlvaroPica · 2018-12-10T11:42:17Z

Any update on this implementation?

I am now facing this problem with PostgreSQL and SQLAlchemy and would love to have that "on duplicate update".

Thanks for the work

valewyss · 2019-04-16T13:03:21Z

A workaround would be to remove the unique index in the database:

sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'"
afterwards
df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)
can be executed.

Just let your MySQL Server add the index again and drop the duplicates.
sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"

Depending on your specific application, this can be helpful.
Anyway if_exists option like append_skipdupes would be much better.

cgi1 · 2019-05-14T21:07:28Z

append_skipdupes would be the perfect way to handle this.

macdet · 2019-06-28T19:41:08Z

yes, append_skipdupes +1

jtkiley · 2019-08-06T14:47:10Z

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

netchose · 2019-10-24T08:31:17Z

an other workaround with MariaDb and MySql :
df.to_csv("test.csv")
then use :
LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytable or
LOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable.

LOAD DATA is very faster than INSERT.

complete code:

csv_path = str(Path(application_path) / "tmp" / "tmp.csv").replace("\\", "\\\\")
df.to_csv(csv_path, index=False, sep='\t', quotechar="'", na_rep=r'\N')
rq = """LOAD DATA LOCAL INFILE '{file_path}' REPLACE INTO TABLE {db}.{db_table}
        LINES TERMINATED BY '\\r\\n'
        IGNORE 1 LINES
         ({col});
        """.format(db=db,
                   file_path=csv_path,
                   db_table=table_name,
                   col=",".join(df.columns.tolist()))

kjford · 2019-12-09T05:59:49Z

I believe this is being addressed in #29636 with the upsert_ignore argument, which addresses #14553.

iveteran · 2020-06-20T14:36:12Z

append_skipdupes +1

grantog · 2020-08-23T16:06:40Z

+1 for append_skipdupes

Arham-Aalam · 2020-08-26T08:41:32Z

Agree 'append_skipdupes' should be added.

rahullak · 2020-09-21T03:10:23Z

Yes, please. 'append_skipdupes' should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.

devashishnyati · 2020-10-30T22:48:03Z

+1 for append_skipdupes

rishabh-vij · 2020-11-12T16:18:50Z

append_skipdupes +1

mc55boy · 2020-11-21T20:55:21Z

append_skipdupes +1

IsraaMa · 2020-11-29T01:29:03Z

+1 for append_skipdupes

BuSHari · 2020-11-29T09:45:43Z

Meantime you can use this https://pypi.org/project/pangres/

kxbin · 2021-01-19T09:48:46Z

+1 for append_skipdupes

frostless · 2021-03-25T23:52:33Z

+1 for append_skipdupes

singhal2 · 2021-04-07T16:03:45Z

+1 for append_skipdupes. IMO, an option to update the duplicates would also be nice. Perhaps append_updatedupes.

tombohub · 2021-04-14T04:45:07Z

+1

tombohub · 2021-04-20T21:10:35Z

I have made small script for my use to allow INSERT IGNORE in mysql:

NOTE: This is copy paste from my Database class, please adjust for your use!

    def save_dataframe(self, df: pd.DataFrame, table: str):
        '''
        Save dataframe to the database. 
        Index is saved if it has name. If it's None it will not be saved.
        It implements INSERT IGNORE when inserting rows into the MySQL table.
        Table needs to exist before. 

        Arguments:
            df {pd.DataFrame} -- dataframe to save
            table {str} -- name of the db table
        '''
        if df.index.name is None:
            save_index = False
        else:
            save_index = True

        self._insert_conflict_ignore(df=df, table=table, index=save_index)

   
    def _insert_conflict_ignore(self, df: pd.DataFrame, table: str, index: bool):
        """
        Saves dataframe to the MySQL database with 'INSERT IGNORE' query.
        
        First it uses pandas.to_sql to save to temporary table.
        After that it uses SQL to transfer the data to destination table, matching the columns.
        Destination table needs to exist already. 
        Final step is deleting the temporary table.

        Parameters
        ----------
        df : pd.DataFrame
            dataframe to save
        table : str
            destination table name
        """
        # generate random table name for concurrent writing
        temp_table = ''.join(random.choice(string.ascii_letters) for i in range(10))
        try:
            df.to_sql(temp_table, self.conn, index=index)
            columns = self._table_column_names(table=temp_table)
            insert_query = f'INSERT IGNORE INTO {table}({columns}) SELECT {columns} FROM `{temp_table}`'
            self.conn.execute(insert_query)
        except Exception as e:
            print(e)        

        # drop temp table
        drop_query = f'DROP TABLE IF EXISTS `{temp_table}`'
        self.conn.execute(drop_query)


    def _table_column_names(self, table: str) -> str:
        """
        Get column names from database table

        Parameters
        ----------
        table : str
            name of the table

        Returns
        -------
        str
            names of columns as a string so we can interpolate into the SQL queries
        """
        query = f"SELECT column_name FROM information_schema.columns WHERE table_name = '{table}'"
        rows = self.conn.execute(query)
        dirty_names = [i[0] for i in rows]
        clean_names = '`' + '`, `'.join(map(str, dirty_names)) + '`'
        return clean_names

https://gist.github.com/tombohub/0c666583c48c1686c736ae2eb76cb2ea

tinglinliu · 2022-01-12T05:24:17Z

+1 for append_skipdupes

hyamanieu · 2022-01-12T06:39:22Z

Rather than upvoting this issue which already has a lot of votes, someone could help with this pr: #29636

Mausy5043 · 2022-04-30T13:52:45Z

Instead of skipping duplicates an option to choose between raise, ignore and replace would be even better. This way you can choose to have an exception raised, skip the duplicate or have the duplicate row removed and the new data inserted.

benboughton1 · 2022-05-08T13:59:09Z

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:
CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

Is there a postgresql equivalent for this?

keivanipchihagh · 2022-05-20T07:36:12Z

Agreed that it would be good to be able to deal with this with options in df.to_sql().
Here's the workaround I use in sqlite:
CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.
Is there a postgresql equivalent for this?

Unfortunately, I couldn't find an equivalent for this on PostgreSQL when creating the table. (You can use this in insert or update commands but that's not the case here)

redreamality · 2022-06-23T10:34:02Z

A problem not solved from 2017 to 2022?
if_exists operates on table level, an extra keyword arg for 'skip_duplicates' is also acceptable

bruppfab · 2022-07-05T11:55:33Z

+1 for append_skipdupes

chabazite · 2022-07-16T18:00:32Z

+1 for append_skipdupes

Magnum35puc · 2022-09-20T07:43:57Z

+1 for append_skipdupes

perofskite · 2022-10-14T06:49:37Z

+1 for append_skipdupes

ThomasAuriel · 2022-10-21T10:46:59Z

+1 for append_skipdupes

seanjedi · 2022-12-01T23:03:23Z

Is this issue resolved yet?

motishaku · 2023-02-16T08:38:03Z

Does anyone know if its ever planned to be added?

redreamality · 2023-02-16T13:26:44Z

The official discussion in #49246 suggest that this issue seems not to be the current focus point of pandas, suggest closing.

#15988 (comment) from tombohub seems a workaround for it.

For upsert, just replace insert ignore.

felixmarch · 2023-04-17T10:13:28Z

Perhaps it would be nice to add those upsert and insert ignore #15988 (comment) as pandas' utility function? 🤔

nono-london · 2023-04-17T22:44:26Z

I am guessing it works on a batch update basis,
so easiest might be to have user decide "batch=False" and "ignore_error_on_duplicate_key=True".
These 2 are handleable in most databases.
"update on duplicate" is usually more taylor made upon databases.
Just a thought, and thank you for the lib which is great!
Best

keivanipchihagh · 2023-05-02T18:26:35Z

I believe this method would beautifully solve the problem until a native function is built into the project.

jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Apr 13, 2017

jonespm mentioned this issue Dec 10, 2018

Ignore null user ids in cron run tl-its-umich-edu/my-learning-analytics#325

Merged

mroeschke added the Enhancement label May 16, 2020

redreamality mentioned this issue Oct 22, 2022

ENH: The need to re-design to_sql() method #49246

Closed

3 tasks

mroeschke mentioned this issue May 16, 2023

DOC: Add to_sql example of conflict with method parameter #53264

Merged

6 tasks

mroeschke closed this as completed in #53264 Jun 20, 2023

gy-mate mentioned this issue Dec 31, 2023

DOC: Pandas.to_sql has the following issues #52488

Closed

1 task

When using to_sql(), continue if duplicate primary keys are detected? #15988

When using to_sql(), continue if duplicate primary keys are detected? #15988

Comments

rosstripi commented Apr 12, 2017

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

rockg commented Apr 13, 2017

jorisvandenbossche commented Apr 13, 2017

muniswamy89 commented Jun 6, 2018

AlvaroPica commented Dec 10, 2018

valewyss commented Apr 16, 2019

cgi1 commented May 14, 2019

macdet commented Jun 28, 2019

jtkiley commented Aug 6, 2019

netchose commented Oct 24, 2019 • edited Loading

kjford commented Dec 9, 2019

iveteran commented Jun 20, 2020

grantog commented Aug 23, 2020

Arham-Aalam commented Aug 26, 2020

rahullak commented Sep 21, 2020

devashishnyati commented Oct 30, 2020

rishabh-vij commented Nov 12, 2020

mc55boy commented Nov 21, 2020

IsraaMa commented Nov 29, 2020

BuSHari commented Nov 29, 2020

kxbin commented Jan 19, 2021

frostless commented Mar 25, 2021

singhal2 commented Apr 7, 2021

tombohub commented Apr 14, 2021

tombohub commented Apr 20, 2021 • edited Loading

tinglinliu commented Jan 12, 2022

hyamanieu commented Jan 12, 2022

Mausy5043 commented Apr 30, 2022

benboughton1 commented May 8, 2022

keivanipchihagh commented May 20, 2022 • edited Loading

redreamality commented Jun 23, 2022

bruppfab commented Jul 5, 2022

chabazite commented Jul 16, 2022

Magnum35puc commented Sep 20, 2022

perofskite commented Oct 14, 2022

ThomasAuriel commented Oct 21, 2022

seanjedi commented Dec 1, 2022

motishaku commented Feb 16, 2023

redreamality commented Feb 16, 2023

felixmarch commented Apr 17, 2023

nono-london commented Apr 17, 2023

keivanipchihagh commented May 2, 2023 • edited Loading

Output of `pd.show_versions()`

netchose commented Oct 24, 2019 •

edited

Loading

tombohub commented Apr 20, 2021 •

edited

Loading

keivanipchihagh commented May 20, 2022 •

edited

Loading

keivanipchihagh commented May 2, 2023 •

edited

Loading