pyspark.sql.DataFrame.unionByName¶
-
DataFrame.
unionByName
(other: pyspark.sql.dataframe.DataFrame, allowMissingColumns: bool = False) → pyspark.sql.dataframe.DataFrame[source]¶ Returns a new
DataFrame
containing union of rows in this and anotherDataFrame
.This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by
distinct()
.New in version 2.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- Returns
DataFrame
Combined DataFrame.
Examples
The difference between this function and
union()
is that this function resolves columns by name (not by position):>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"]) >>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"]) >>> df1.unionByName(df2).show() +----+----+----+ |col0|col1|col2| +----+----+----+ | 1| 2| 3| | 6| 4| 5| +----+----+----+
When the parameter allowMissingColumns is
True
, the set of column names in this and otherDataFrame
can differ; missing columns will be filled with null. Further, the missing columns of thisDataFrame
will be added at the end in the schema of the union result:>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"]) >>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"]) >>> df1.unionByName(df2, allowMissingColumns=True).show() +----+----+----+----+ |col0|col1|col2|col3| +----+----+----+----+ | 1| 2| 3|null| |null| 4| 5| 6| +----+----+----+----+