pyspark.sql.DataFrame.subtract#

DataFrame.subtract(other)[source]#

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

otherDataFrame: Another DataFrame that needs to be subtracted.

Returns

DataFrame: Subtracted DataFrame.

See also

DataFrame.exceptAll: Similar to subtract, but preserves duplicates.

Notes

This is equivalent to EXCEPT DISTINCT in SQL.

Examples

Example 1: Subtracting two DataFrames with the same schema

>>> df1 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"])
>>> df2 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"])
>>> result_df = df1.subtract(df2)
>>> result_df.show()
+---+---+
| C1| C2|
+---+---+
|  c|  4|
+---+---+

Example 2: Subtracting two DataFrames with different schemas

>>> df1 = spark.createDataFrame([(1, "A"), (2, "B")], ["id", "value"])
>>> df2 = spark.createDataFrame([(2, "B"), (3, "C")], ["id", "value"])
>>> result_df = df1.subtract(df2)
>>> result_df.show()
+---+-----+
| id|value|
+---+-----+
|  1|    A|
+---+-----+

Example 3: Subtracting two DataFrames with mismatched columns

>>> df1 = spark.createDataFrame([(1, 2)], ["A", "B"])
>>> df2 = spark.createDataFrame([(1, 2)], ["C", "D"])
>>> result_df = df1.subtract(df2)
>>> result_df.show()
+---+---+
|  A|  B|
+---+---+
+---+---+