pyspark.sql.DataFrame.exists#

DataFrame.exists()[source]#

Return a Column object for an EXISTS Subquery.

The exists method provides a way to create a boolean column that checks for the presence of related records in a subquery. When applied within a DataFrame, this method allows you to filter rows based on whether matching records exist in the related dataset. The resulting Column object can be used directly in filtering conditions or as a computed column.

New in version 4.0.0.

Returns
Column

A Column object representing an EXISTS subquery

Examples

Setup sample data for customers and orders.

>>> data_customers = [
...     (101, "Alice", "USA"), (102, "Bob", "Canada"), (103, "Charlie", "USA"),
...     (104, "David", "Australia")
... ]
>>> data_orders = [
...     (1, 101, "2023-01-15", 250), (2, 102, "2023-01-20", 300),
...     (3, 103, "2023-01-25", 400), (4, 101, "2023-02-05", 150)
... ]
>>> customers = spark.createDataFrame(
...     data_customers, ["customer_id", "customer_name", "country"])
>>> orders = spark.createDataFrame(
...     data_orders, ["order_id", "customer_id", "order_date", "total_amount"])

Example 1: Filter for customers who have placed at least one order.

>>> from pyspark.sql import functions as sf
>>> customers.alias("c").where(
...     orders.alias("o").where(
...         sf.col("o.customer_id") == sf.col("c.customer_id").outer()
...     ).exists()
... ).orderBy("customer_id").show()
+-----------+-------------+-------+
|customer_id|customer_name|country|
+-----------+-------------+-------+
|        101|        Alice|    USA|
|        102|          Bob| Canada|
|        103|      Charlie|    USA|
+-----------+-------------+-------+

Example 2: Filter for customers who have never placed an order.

>>> from pyspark.sql import functions as sf
>>> customers.alias("c").where(
...     ~orders.alias("o").where(
...         sf.col("o.customer_id") == sf.col("c.customer_id").outer()
...     ).exists()
... ).orderBy("customer_id").show()
+-----------+-------------+---------+
|customer_id|customer_name|  country|
+-----------+-------------+---------+
|        104|        David|Australia|
+-----------+-------------+---------+

Example 3: Find Orders from Customers in the USA.

>>> from pyspark.sql import functions as sf
>>> orders.alias("o").where(
...     customers.alias("c").where(
...         (sf.col("c.customer_id") == sf.col("o.customer_id").outer())
...         & (sf.col("country") == "USA")
...     ).exists()
... ).orderBy("order_id").show()
+--------+-----------+----------+------------+
|order_id|customer_id|order_date|total_amount|
+--------+-----------+----------+------------+
|       1|        101|2023-01-15|         250|
|       3|        103|2023-01-25|         400|
|       4|        101|2023-02-05|         150|
+--------+-----------+----------+------------+