Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using to_sql(), continue if duplicate primary keys are detected? #15988

Closed
rosstripi opened this issue Apr 12, 2017 · 41 comments · Fixed by #53264
Closed

When using to_sql(), continue if duplicate primary keys are detected? #15988

rosstripi opened this issue Apr 12, 2017 · 41 comments · Fixed by #53264
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query

Comments

@rosstripi
Copy link

Code Sample, a copy-pastable example if possible

df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)

Problem description

I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.

It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.

Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: English_United States.1252

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

@rockg
Copy link
Contributor

rockg commented Apr 13, 2017

This should also support the "on duplicate update" mode as well.

@jorisvandenbossche jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Apr 13, 2017
@jorisvandenbossche
Copy link
Member

@rosstripi I think the idea to have this would certainly be accepted, but AFAIK the main bottleneck is an implementation for this using sql/sqlalchemy in a flavor agnostic way. Some exploration how this could be done is certainly welcome!

@muniswamy89
Copy link

Hi did you figure out any workaround for this? Please let me know

@AlvaroPica
Copy link

Any update on this implementation?

I am now facing this problem with PostgreSQL and SQLAlchemy and would love to have that "on duplicate update".

Thanks for the work

@valewyss
Copy link

A workaround would be to remove the unique index in the database:

sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'"
afterwards
df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)
can be executed.

Just let your MySQL Server add the index again and drop the duplicates.
sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"

Depending on your specific application, this can be helpful.
Anyway if_exists option like append_skipdupes would be much better.

@cgi1
Copy link

cgi1 commented May 14, 2019

append_skipdupes would be the perfect way to handle this.

@macdet
Copy link

macdet commented Jun 28, 2019

yes, append_skipdupes +1

@jtkiley
Copy link
Contributor

jtkiley commented Aug 6, 2019

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

@netchose
Copy link

netchose commented Oct 24, 2019

an other workaround with MariaDb and MySql :
df.to_csv("test.csv")
then use :
LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytable or
LOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable.

LOAD DATA is very faster than INSERT.

complete code:

csv_path = str(Path(application_path) / "tmp" / "tmp.csv").replace("\\", "\\\\")
df.to_csv(csv_path, index=False, sep='\t', quotechar="'", na_rep=r'\N')
rq = """LOAD DATA LOCAL INFILE '{file_path}' REPLACE INTO TABLE {db}.{db_table}
        LINES TERMINATED BY '\\r\\n'
        IGNORE 1 LINES
         ({col});
        """.format(db=db,
                   file_path=csv_path,
                   db_table=table_name,
                   col=",".join(df.columns.tolist()))

@kjford
Copy link
Contributor

kjford commented Dec 9, 2019

I believe this is being addressed in #29636 with the upsert_ignore argument, which addresses #14553.

@iveteran
Copy link

append_skipdupes +1

@grantog
Copy link

grantog commented Aug 23, 2020

+1 for append_skipdupes

@Arham-Aalam
Copy link

Agree 'append_skipdupes' should be added.

@rahullak
Copy link

Yes, please. 'append_skipdupes' should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.

@devashishnyati
Copy link

+1 for append_skipdupes

@rishabh-vij
Copy link

append_skipdupes +1

1 similar comment
@mc55boy
Copy link

mc55boy commented Nov 21, 2020

append_skipdupes +1

@IsraaMa
Copy link

IsraaMa commented Nov 29, 2020

+1 for append_skipdupes

@BuSHari
Copy link

BuSHari commented Nov 29, 2020

Meantime you can use this https://pypi.org/project/pangres/

@kxbin
Copy link

kxbin commented Jan 19, 2021

+1 for append_skipdupes

1 similar comment
@frostless
Copy link

+1 for append_skipdupes

@singhal2
Copy link

singhal2 commented Apr 7, 2021

+1 for append_skipdupes. IMO, an option to update the duplicates would also be nice. Perhaps append_updatedupes.

@tombohub
Copy link

+1

@tombohub
Copy link

tombohub commented Apr 20, 2021

I have made small script for my use to allow INSERT IGNORE in mysql:

NOTE: This is copy paste from my Database class, please adjust for your use!

    def save_dataframe(self, df: pd.DataFrame, table: str):
        '''
        Save dataframe to the database. 
        Index is saved if it has name. If it's None it will not be saved.
        It implements INSERT IGNORE when inserting rows into the MySQL table.
        Table needs to exist before. 

        Arguments:
            df {pd.DataFrame} -- dataframe to save
            table {str} -- name of the db table
        '''
        if df.index.name is None:
            save_index = False
        else:
            save_index = True

        self._insert_conflict_ignore(df=df, table=table, index=save_index)

   
    def _insert_conflict_ignore(self, df: pd.DataFrame, table: str, index: bool):
        """
        Saves dataframe to the MySQL database with 'INSERT IGNORE' query.
        
        First it uses pandas.to_sql to save to temporary table.
        After that it uses SQL to transfer the data to destination table, matching the columns.
        Destination table needs to exist already. 
        Final step is deleting the temporary table.

        Parameters
        ----------
        df : pd.DataFrame
            dataframe to save
        table : str
            destination table name
        """
        # generate random table name for concurrent writing
        temp_table = ''.join(random.choice(string.ascii_letters) for i in range(10))
        try:
            df.to_sql(temp_table, self.conn, index=index)
            columns = self._table_column_names(table=temp_table)
            insert_query = f'INSERT IGNORE INTO {table}({columns}) SELECT {columns} FROM `{temp_table}`'
            self.conn.execute(insert_query)
        except Exception as e:
            print(e)        

        # drop temp table
        drop_query = f'DROP TABLE IF EXISTS `{temp_table}`'
        self.conn.execute(drop_query)


    def _table_column_names(self, table: str) -> str:
        """
        Get column names from database table

        Parameters
        ----------
        table : str
            name of the table

        Returns
        -------
        str
            names of columns as a string so we can interpolate into the SQL queries
        """
        query = f"SELECT column_name FROM information_schema.columns WHERE table_name = '{table}'"
        rows = self.conn.execute(query)
        dirty_names = [i[0] for i in rows]
        clean_names = '`' + '`, `'.join(map(str, dirty_names)) + '`'
        return clean_names

https://gist.github.com/tombohub/0c666583c48c1686c736ae2eb76cb2ea

@tinglinliu
Copy link

+1 for append_skipdupes

@hyamanieu
Copy link

Rather than upvoting this issue which already has a lot of votes, someone could help with this pr: #29636

@Mausy5043
Copy link

Instead of skipping duplicates an option to choose between raise, ignore and replace would be even better. This way you can choose to have an exception raised, skip the duplicate or have the duplicate row removed and the new data inserted.

@benboughton1
Copy link

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

Is there a postgresql equivalent for this?

@keivanipchihagh
Copy link

keivanipchihagh commented May 20, 2022

Agreed that it would be good to be able to deal with this with options in df.to_sql().
Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

Is there a postgresql equivalent for this?

Unfortunately, I couldn't find an equivalent for this on PostgreSQL when creating the table. (You can use this in insert or update commands but that's not the case here)

@redreamality
Copy link

A problem not solved from 2017 to 2022?
if_exists operates on table level, an extra keyword arg for 'skip_duplicates' is also acceptable

@bruppfab
Copy link

bruppfab commented Jul 5, 2022

+1 for append_skipdupes

4 similar comments
@chabazite
Copy link

+1 for append_skipdupes

@Magnum35puc
Copy link

+1 for append_skipdupes

@perofskite
Copy link

+1 for append_skipdupes

@ThomasAuriel
Copy link

+1 for append_skipdupes

@seanjedi
Copy link
Contributor

seanjedi commented Dec 1, 2022

Is this issue resolved yet?

@motishaku
Copy link

Does anyone know if its ever planned to be added?

@redreamality
Copy link

The official discussion in #49246 suggest that this issue seems not to be the current focus point of pandas, suggest closing.

#15988 (comment) from tombohub seems a workaround for it.

For upsert, just replace insert ignore.

@felixmarch
Copy link

Perhaps it would be nice to add those upsert and insert ignore #15988 (comment) as pandas' utility function? 🤔

@nono-london
Copy link

I am guessing it works on a batch update basis,
so easiest might be to have user decide "batch=False" and "ignore_error_on_duplicate_key=True".
These 2 are handleable in most databases.
"update on duplicate" is usually more taylor made upon databases.
Just a thought, and thank you for the lib which is great!
Best

@keivanipchihagh
Copy link

keivanipchihagh commented May 2, 2023

I believe this method would beautifully solve the problem until a native function is built into the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.