pyspark.sql.DataFrame.subtract#

DataFrame.subtract(other)[source]#

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
otherDataFrame

Another DataFrame that needs to be subtracted.

Returns
DataFrame

Subtracted DataFrame.

See also

DataFrame.exceptAll

Similar to subtract, but preserves duplicates.

Notes

This is equivalent to EXCEPT DISTINCT in SQL.

Examples

Example 1: Subtracting two DataFrames with the same schema

>>> df1 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"])
>>> df2 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"])
>>> result_df = df1.subtract(df2)
>>> result_df.show()
+---+---+
| C1| C2|
+---+---+
|  c|  4|
+---+---+

Example 2: Subtracting two DataFrames with different schemas

>>> df1 = spark.createDataFrame([(1, "A"), (2, "B")], ["id", "value"])
>>> df2 = spark.createDataFrame([(2, "B"), (3, "C")], ["id", "value"])
>>> result_df = df1.subtract(df2)
>>> result_df.show()
+---+-----+
| id|value|
+---+-----+
|  1|    A|
+---+-----+

Example 3: Subtracting two DataFrames with mismatched columns

>>> df1 = spark.createDataFrame([(1, 2)], ["A", "B"])
>>> df2 = spark.createDataFrame([(1, 2)], ["C", "D"])
>>> result_df = df1.subtract(df2)
>>> result_df.show()
+---+---+
|  A|  B|
+---+---+
+---+---+