-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using to_sql(), continue if duplicate primary keys are detected? #15988
Comments
This should also support the "on duplicate update" mode as well. |
@rosstripi I think the idea to have this would certainly be accepted, but AFAIK the main bottleneck is an implementation for this using sql/sqlalchemy in a flavor agnostic way. Some exploration how this could be done is certainly welcome! |
Hi did you figure out any workaround for this? Please let me know |
Any update on this implementation? I am now facing this problem with PostgreSQL and SQLAlchemy and would love to have that "on duplicate update". Thanks for the work |
A workaround would be to remove the unique index in the database:
Just let your MySQL Server add the index again and drop the duplicates. Depending on your specific application, this can be helpful. |
|
yes, append_skipdupes +1 |
Agreed that it would be good to be able to deal with this with options in Here's the workaround I use in sqlite: CREATE TABLE IF NOT EXISTS my_table_name (
some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
...
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable. |
an other workaround with MariaDb and MySql : LOAD DATA is very faster than INSERT. complete code:
|
append_skipdupes +1 |
+1 for append_skipdupes |
Agree 'append_skipdupes' should be added. |
Yes, please. 'append_skipdupes' should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows. |
+1 for append_skipdupes |
append_skipdupes +1 |
1 similar comment
append_skipdupes +1 |
+1 for append_skipdupes |
Meantime you can use this https://pypi.org/project/pangres/ |
+1 for append_skipdupes |
1 similar comment
+1 for append_skipdupes |
+1 for append_skipdupes. IMO, an option to update the duplicates would also be nice. Perhaps append_updatedupes. |
+1 |
I have made small script for my use to allow INSERT IGNORE in mysql: NOTE: This is copy paste from my Database class, please adjust for your use! def save_dataframe(self, df: pd.DataFrame, table: str):
'''
Save dataframe to the database.
Index is saved if it has name. If it's None it will not be saved.
It implements INSERT IGNORE when inserting rows into the MySQL table.
Table needs to exist before.
Arguments:
df {pd.DataFrame} -- dataframe to save
table {str} -- name of the db table
'''
if df.index.name is None:
save_index = False
else:
save_index = True
self._insert_conflict_ignore(df=df, table=table, index=save_index)
def _insert_conflict_ignore(self, df: pd.DataFrame, table: str, index: bool):
"""
Saves dataframe to the MySQL database with 'INSERT IGNORE' query.
First it uses pandas.to_sql to save to temporary table.
After that it uses SQL to transfer the data to destination table, matching the columns.
Destination table needs to exist already.
Final step is deleting the temporary table.
Parameters
----------
df : pd.DataFrame
dataframe to save
table : str
destination table name
"""
# generate random table name for concurrent writing
temp_table = ''.join(random.choice(string.ascii_letters) for i in range(10))
try:
df.to_sql(temp_table, self.conn, index=index)
columns = self._table_column_names(table=temp_table)
insert_query = f'INSERT IGNORE INTO {table}({columns}) SELECT {columns} FROM `{temp_table}`'
self.conn.execute(insert_query)
except Exception as e:
print(e)
# drop temp table
drop_query = f'DROP TABLE IF EXISTS `{temp_table}`'
self.conn.execute(drop_query)
def _table_column_names(self, table: str) -> str:
"""
Get column names from database table
Parameters
----------
table : str
name of the table
Returns
-------
str
names of columns as a string so we can interpolate into the SQL queries
"""
query = f"SELECT column_name FROM information_schema.columns WHERE table_name = '{table}'"
rows = self.conn.execute(query)
dirty_names = [i[0] for i in rows]
clean_names = '`' + '`, `'.join(map(str, dirty_names)) + '`'
return clean_names https://gist.github.com/tombohub/0c666583c48c1686c736ae2eb76cb2ea |
+1 for append_skipdupes |
Rather than upvoting this issue which already has a lot of votes, someone could help with this pr: #29636 |
Instead of skipping duplicates an option to choose between |
Is there a postgresql equivalent for this? |
Unfortunately, I couldn't find an equivalent for this on PostgreSQL when creating the table. (You can use this in insert or update commands but that's not the case here) |
A problem not solved from 2017 to 2022? |
+1 for append_skipdupes |
4 similar comments
+1 for append_skipdupes |
+1 for append_skipdupes |
+1 for append_skipdupes |
+1 for append_skipdupes |
Is this issue resolved yet? |
Does anyone know if its ever planned to be added? |
The official discussion in #49246 suggest that this issue seems not to be the current focus point of pandas, suggest closing. #15988 (comment) from tombohub seems a workaround for it. For |
Perhaps it would be nice to add those |
I am guessing it works on a batch update basis, |
I believe this method would beautifully solve the problem until a native function is built into the project. |
Code Sample, a copy-pastable example if possible
Problem description
I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But
to_sql()
completely stops executing if even one duplicate is detected.It would make sense for
to_sql(if_exists='append')
to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional
if_exists
option like'append_skipdupes'
?Output of
pd.show_versions()
pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: