pyspark.sql.DataFrame.exists#
- DataFrame.exists()[source]#
Return a Column object for an EXISTS Subquery.
The exists method provides a way to create a boolean column that checks for the presence of related records in a subquery. When applied within a DataFrame, this method allows you to filter rows based on whether matching records exist in the related dataset. The resulting Column object can be used directly in filtering conditions or as a computed column.
New in version 4.0.0.
- Returns
Column
A Column object representing an EXISTS subquery
Examples
Setup sample data for customers and orders.
>>> data_customers = [ ... (101, "Alice", "USA"), (102, "Bob", "Canada"), (103, "Charlie", "USA"), ... (104, "David", "Australia") ... ] >>> data_orders = [ ... (1, 101, "2023-01-15", 250), (2, 102, "2023-01-20", 300), ... (3, 103, "2023-01-25", 400), (4, 101, "2023-02-05", 150) ... ] >>> customers = spark.createDataFrame( ... data_customers, ["customer_id", "customer_name", "country"]) >>> orders = spark.createDataFrame( ... data_orders, ["order_id", "customer_id", "order_date", "total_amount"])
Example 1: Filter for customers who have placed at least one order.
>>> from pyspark.sql import functions as sf >>> customers.alias("c").where( ... orders.alias("o").where( ... sf.col("o.customer_id") == sf.col("c.customer_id").outer() ... ).exists() ... ).orderBy("customer_id").show() +-----------+-------------+-------+ |customer_id|customer_name|country| +-----------+-------------+-------+ | 101| Alice| USA| | 102| Bob| Canada| | 103| Charlie| USA| +-----------+-------------+-------+
Example 2: Filter for customers who have never placed an order.
>>> from pyspark.sql import functions as sf >>> customers.alias("c").where( ... ~orders.alias("o").where( ... sf.col("o.customer_id") == sf.col("c.customer_id").outer() ... ).exists() ... ).orderBy("customer_id").show() +-----------+-------------+---------+ |customer_id|customer_name| country| +-----------+-------------+---------+ | 104| David|Australia| +-----------+-------------+---------+
Example 3: Find Orders from Customers in the USA.
>>> from pyspark.sql import functions as sf >>> orders.alias("o").where( ... customers.alias("c").where( ... (sf.col("c.customer_id") == sf.col("o.customer_id").outer()) ... & (sf.col("country") == "USA") ... ).exists() ... ).orderBy("order_id").show() +--------+-----------+----------+------------+ |order_id|customer_id|order_date|total_amount| +--------+-----------+----------+------------+ | 1| 101|2023-01-15| 250| | 3| 103|2023-01-25| 400| | 4| 101|2023-02-05| 150| +--------+-----------+----------+------------+